A Simple Baseline for Discriminating Similar Languages
نویسنده
چکیده
This paper describes an approach to discriminating similar languages using wordand characterbased features, submitted as the Queen Mary University of London entry to the Discriminating Similar Languages shared task. Our motivation was to investigate how well a simple, datadriven, linguistically naive method could perform, in order to provide a baseline by which more linguistically complex or knowledge-rich approaches can be judged. Using a standard supervised classifier with word and character n-grams as features, we achieved over 90% accuracy in the test; on fixing simple file handling and feature extraction bugs, this improved to over 95%, comparable to the best submitted systems. Similar accuracy is achieved using only word unigram features.
منابع مشابه
Experiments in Sentence Language Identification with Groups of Similar Languages
Language identification is a simple problem that becomes much more difficult when its usual assumptions are broken. In this paper we consider the task of classifying short segments of text in closely-related languages for the Discriminating Similar Languages shared task, which is broken into six subtasks, (A) Bosnian, Croatian, and Serbian, (B) Indonesian and Malay, (C) Czech and Slovak, (D) Br...
متن کاملMerging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection
This paper presents the compilation of the DSL corpus collection created for the DSL (Discriminating Similar Languages) shared task to be held at the VarDial workshop at COLING 2014. The DSL corpus collection were merged from three comparable corpora to provide a suitable dataset for automatic classification to discriminate similar languages and language varieties. Along with the description of...
متن کاملDistributed Representations of Words and Documents for Discriminating Similar Languages
Discriminating between similar languages or language varieties aims to detect lexical and semantic variations in order to classify these varieties of languages. In this work we describe the system built by the Pattern Recognition and Human Language Technology (PRHLT) research center Universitat Politècnica de València and Autoritas Consulting for the Discriminating between similar languages (DS...
متن کاملEfficient Discrimination Between Closely Related Languages
In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages make it very hard to classify them correctly using standard methods that have been proposed in the literature. Dedicated models that focus on specific discrimination tasks help to improve the accuracy of general...
متن کاملWhen Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages
We present the results of our participation in the VarDial 4 shared task on discriminating closely related languages. Our submission includes simple traditional models using linear support vector machines (SVMs) and a neural network (NN). The main idea was to leverage language group information. We did so with a two-layer approach in the traditional model and a multi-task objective in the neura...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014